Iteration Three – Formula 1
Kaloyan Rakov
Formula 1 is a motorsport race, including 10 teams with 2 drivers each. The championship is divided into 2 titles: the Drivers' Championship (individual for each driver) and the Constructor’s Championship (Team-based). In this notebook I will dive into all the factors that have an effect on the drivers' performances and create a machine-learning model that will predict them.
from IPython.display import Image
Image(filename='Mercedes.jpg')
Data provisioning¶
What data are we working with:
For this project I am working with multiple datasets. One of them is the data from last season, as it gives us important context about the drivers and their past performances. The "Formula1_2024season_raceResults.csv" gives us access to every result every driver has produced throughout all of the races on the 2024 calendar. The other datasets are collected from each individual race and are based on the drivers' qualifying results for said race. It's important to look at each race individually, since performances vary considerably from track to track and otherwise we would be oversimplifying the project. The data for the qualifications is saved as QUALIFYING.csv. I am also using the racer's starting positions (STARTING_GRID.csv) and the final results (RESULTS.csv) so we can compare the predicitons to the real results.
The data from the 2024 season was available here: https://github.com/toUpperCase78/formula1-datasets/blob/master/Formula1_2024season_raceResults.csv
The RESULTS.csv, STARTING_GRID.csv, QUALIFYING.csv were scraped from the official Formula 1 website. The scraping was done here: Data Scraping For detailed description of the datasets I have provided a Data Dictionary here: Data Dictionary
Loading the data:
import pandas as pd
df = pd.read_csv('Formula1_2024season_raceResults.csv')
Sampling the data:¶
Let's get a better idea of what the the dataset is by printing the first 5 lines...
df.head(5)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bahrain | 1 | 1 | Max Verstappen | Red Bull Racing Honda RBPT | 1 | 57 | 1:31:44.742 | 26 | Yes | 1:32.608 |
| 1 | Bahrain | 2 | 11 | Sergio Perez | Red Bull Racing Honda RBPT | 5 | 57 | +22.457 | 18 | No | 1:34.364 |
| 2 | Bahrain | 3 | 55 | Carlos Sainz | Ferrari | 4 | 57 | +25.110 | 15 | No | 1:34.507 |
| 3 | Bahrain | 4 | 16 | Charles Leclerc | Ferrari | 2 | 57 | +39.669 | 12 | No | 1:34.090 |
| 4 | Bahrain | 5 | 63 | George Russell | Mercedes | 3 | 57 | +46.788 | 10 | No | 1:35.065 |
... and the last 5 lines
df.tail(5)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 474 | Abu Dhabi | 16 | 20 | Kevin Magnussen | Haas Ferrari | 14 | 57 | +1 lap | 0 | Yes | 1:25.637 |
| 475 | Abu Dhabi | 17 | 30 | Liam Lawson | RB Honda RBPT | 12 | 55 | DNF | 0 | No | 1:28.751 |
| 476 | Abu Dhabi | NC | 77 | Valtteri Bottas | Kick Sauber Ferrari | 9 | 30 | DNF | 0 | No | 1:29.482 |
| 477 | Abu Dhabi | NC | 43 | Franco Colapinto | Williams Mercedes | 20 | 26 | DNF | 0 | No | 1:29.411 |
| 478 | Abu Dhabi | NC | 11 | Sergio Perez | Red Bull Racing Honda RBPT | 10 | 0 | DNF | 0 | No | NaN |
The race ends when the first racer completes the needed number of laps, and then everyone following him finishes theirs. That’s why all the racers after the first one are recorded as "+SS.sss", and the winner is recorded as “H:MM:SS.sss”. In order for me to use the data, I will need to process all the times to be following the same format. Some racers however, get lapped by the winner (the winner manages to complete a full lap around them), which means that when the race finishes, a number of the racers haven’t completed the full number of laps. They are being recorded as “+1 Lap”, “+2 Laps”. There are also the drivers listed as NC (Not Classified) as their final positions – this happens when a driver starts the race, but doesn’t complete at least 90% of it (often caused by accidents or machine failures which in the results is being listed as DNF (Did Not Finish), DSQ (Disqualified) or rarely even DNS (Did Not Start)).
Preprocessing¶
Let's make sure that all the racers that have completed the required amount of laps have their times recorded in the same format: “H:MM:SS.sss” (Here, we are going to call that "Absolute Time"). In the case of a driver having to complete more laps after the race is finished, we are gonna record them as PF (Premature Finish) + the number of laps they haven't completed. If a racer doesn't finish they will be classified as DNF (Regardless of the reason - DNF, DSQ, DNS).
absolute_times = []
for _, row in df.iterrows():
time_str = str(row["Time/Retired"]).strip()
if time_str.startswith("+"):
if time_str[1:].split()[0].isdigit():
absolute_times.append(f"PF {time_str.split()[0]} lap")
continue
winner_mask = (df["Track"] == row["Track"]) & (df["Position"].astype(str).str.strip() == "1")
if winner_mask.any():
winner_time_str = str(df.loc[winner_mask, "Time/Retired"].values[0])
if ":" in winner_time_str:
winner_parts = list(map(float, winner_time_str.split(":")))
winner_total = winner_parts[0]*3600 + winner_parts[1]*60 + winner_parts[2] if len(winner_parts) == 3 else winner_parts[0]*60 + winner_parts[1]
try:
delta = float(time_str[1:])
absolute_total = winner_total + delta
hours = int(absolute_total // 3600)
minutes = int((absolute_total % 3600) // 60)
seconds = absolute_total % 60
absolute_time = f"{hours}:{minutes:02d}:{seconds:06.3f}"
absolute_times.append(absolute_time)
except:
absolute_times.append("Time calculation error")
else:
absolute_times.append("Invalid winner time")
else:
absolute_times.append("No winner found")
else:
absolute_times.append(time_str)
df["Absolute Time"] = absolute_times
Let's take a look at how the data looks now. There is a new column with the Absoulte Time where they follow the same format.
df.head(10)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | Absolute Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bahrain | 1 | 1 | Max Verstappen | Red Bull Racing Honda RBPT | 1 | 57 | 1:31:44.742 | 26 | Yes | 1:32.608 | 1:31:44.742 |
| 1 | Bahrain | 2 | 11 | Sergio Perez | Red Bull Racing Honda RBPT | 5 | 57 | +22.457 | 18 | No | 1:34.364 | 1:32:07.199 |
| 2 | Bahrain | 3 | 55 | Carlos Sainz | Ferrari | 4 | 57 | +25.110 | 15 | No | 1:34.507 | 1:32:09.852 |
| 3 | Bahrain | 4 | 16 | Charles Leclerc | Ferrari | 2 | 57 | +39.669 | 12 | No | 1:34.090 | 1:32:24.411 |
| 4 | Bahrain | 5 | 63 | George Russell | Mercedes | 3 | 57 | +46.788 | 10 | No | 1:35.065 | 1:32:31.530 |
| 5 | Bahrain | 6 | 4 | Lando Norris | McLaren Mercedes | 7 | 57 | +48.458 | 8 | No | 1:34.476 | 1:32:33.200 |
| 6 | Bahrain | 7 | 44 | Lewis Hamilton | Mercedes | 9 | 57 | +50.324 | 6 | No | 1:34.722 | 1:32:35.066 |
| 7 | Bahrain | 8 | 81 | Oscar Piastri | McLaren Mercedes | 8 | 57 | +56.082 | 4 | No | 1:34.983 | 1:32:40.824 |
| 8 | Bahrain | 9 | 14 | Fernando Alonso | Aston Martin Aramco Mercedes | 6 | 57 | +74.887 | 2 | No | 1:34.199 | 1:32:59.629 |
| 9 | Bahrain | 10 | 18 | Lance Stroll | Aston Martin Aramco Mercedes | 12 | 57 | +93.216 | 1 | No | 1:35.632 | 1:33:17.958 |
df.tail(10)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | Absolute Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 469 | Abu Dhabi | 11 | 23 | Alexander Albon | Williams Mercedes | 18 | 57 | +1 lap | 0 | No | 1:29.438 | PF +1 lap |
| 470 | Abu Dhabi | 12 | 22 | Yuki Tsunoda | RB Honda RBPT | 11 | 57 | +1 lap | 0 | No | 1:29.200 | PF +1 lap |
| 471 | Abu Dhabi | 13 | 24 | Guanyu Zhou | Kick Sauber Ferrari | 15 | 57 | +1 lap | 0 | No | 1:27.982 | PF +1 lap |
| 472 | Abu Dhabi | 14 | 18 | Lance Stroll | Aston Martin Aramco Mercedes | 13 | 57 | +1 lap | 0 | No | 1:28.604 | PF +1 lap |
| 473 | Abu Dhabi | 15 | 61 | Jack Doohan | Alpine Renault | 17 | 57 | +1 lap | 0 | No | 1:29.121 | PF +1 lap |
| 474 | Abu Dhabi | 16 | 20 | Kevin Magnussen | Haas Ferrari | 14 | 57 | +1 lap | 0 | Yes | 1:25.637 | PF +1 lap |
| 475 | Abu Dhabi | 17 | 30 | Liam Lawson | RB Honda RBPT | 12 | 55 | DNF | 0 | No | 1:28.751 | DNF |
| 476 | Abu Dhabi | NC | 77 | Valtteri Bottas | Kick Sauber Ferrari | 9 | 30 | DNF | 0 | No | 1:29.482 | DNF |
| 477 | Abu Dhabi | NC | 43 | Franco Colapinto | Williams Mercedes | 20 | 26 | DNF | 0 | No | 1:29.411 | DNF |
| 478 | Abu Dhabi | NC | 11 | Sergio Perez | Red Bull Racing Honda RBPT | 10 | 0 | DNF | 0 | No | NaN | DNF |
When the time isn't available, under "Absolute Time" we either have DNF or The Laps they needed to complete.
Formula 1 is a pretty expensive sport, which relies heavily on sponsorships, which change quite often and subsequently so do the teams' names. There are two types of teams: Works Teams (aka Factory Teams) and Customer Teams. The Factory Teams produce their own cars and engines (Ferrari, Mercedes, etc.), while the Customer teams have to buy their engines (or other parts) from the Works teams. If a team like Haas uses a engine supplied to them by Ferrari, then they will need to carry the Ferrari name as their sponsor. That's how we end up with teams like Aston Martin Aramco Mercedes and Kick Sauber Ferrari. In order to avoid confusion I will adress the teams by their shortened, most simple names. Some teams have renamed themselves for other sponsoring reasons, so I will use their 2025 names.
team_name_mapping = {
'McLaren Mercedes': 'McLaren',
'Mercedes': 'Mercedes',
'Red Bull Racing Honda RBPT': 'Red Bull Racing',
'Ferrari': 'Ferrari',
'RB Honda RBPT': 'Racing Bulls',
'Williams Mercedes': 'Williams',
'Haas Ferrari': 'Haas',
'Kick Sauber Ferrari': 'Kick Sauber',
'Aston Martin Aramco Mercedes': 'Aston Martin',
'Alpine Renault': 'Alpine'
}
df['Team'] = df['Team'].replace(team_name_mapping)
Now we can see that our dataset is using the current team names:
print(df['Team'].unique())
['Red Bull Racing' 'Ferrari' 'Mercedes' 'McLaren' 'Aston Martin' 'Kick Sauber' 'Haas' 'Racing Bulls' 'Williams' 'Alpine']
Cleaning¶
We need to make sure that there are no empty rows or invalid data. Here is what the data looks before cleaning:
df.head(15)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | Absolute Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Bahrain | 1 | 1 | Max Verstappen | Red Bull Racing | 1 | 57 | 1:31:44.742 | 26 | Yes | 1:32.608 | 1:31:44.742 |
| 1 | Bahrain | 2 | 11 | Sergio Perez | Red Bull Racing | 5 | 57 | +22.457 | 18 | No | 1:34.364 | 1:32:07.199 |
| 2 | Bahrain | 3 | 55 | Carlos Sainz | Ferrari | 4 | 57 | +25.110 | 15 | No | 1:34.507 | 1:32:09.852 |
| 3 | Bahrain | 4 | 16 | Charles Leclerc | Ferrari | 2 | 57 | +39.669 | 12 | No | 1:34.090 | 1:32:24.411 |
| 4 | Bahrain | 5 | 63 | George Russell | Mercedes | 3 | 57 | +46.788 | 10 | No | 1:35.065 | 1:32:31.530 |
| 5 | Bahrain | 6 | 4 | Lando Norris | McLaren | 7 | 57 | +48.458 | 8 | No | 1:34.476 | 1:32:33.200 |
| 6 | Bahrain | 7 | 44 | Lewis Hamilton | Mercedes | 9 | 57 | +50.324 | 6 | No | 1:34.722 | 1:32:35.066 |
| 7 | Bahrain | 8 | 81 | Oscar Piastri | McLaren | 8 | 57 | +56.082 | 4 | No | 1:34.983 | 1:32:40.824 |
| 8 | Bahrain | 9 | 14 | Fernando Alonso | Aston Martin | 6 | 57 | +74.887 | 2 | No | 1:34.199 | 1:32:59.629 |
| 9 | Bahrain | 10 | 18 | Lance Stroll | Aston Martin | 12 | 57 | +93.216 | 1 | No | 1:35.632 | 1:33:17.958 |
| 10 | Bahrain | 11 | 24 | Guanyu Zhou | Kick Sauber | 17 | 56 | +1 lap | 0 | No | 1:35.458 | PF +1 lap |
| 11 | Bahrain | 12 | 20 | Kevin Magnussen | Haas | 15 | 56 | +1 lap | 0 | No | 1:35.570 | PF +1 lap |
| 12 | Bahrain | 13 | 3 | Daniel Ricciardo | Racing Bulls | 14 | 56 | +1 lap | 0 | No | 1:35.163 | PF +1 lap |
| 13 | Bahrain | 14 | 22 | Yuki Tsunoda | Racing Bulls | 11 | 56 | +1 lap | 0 | No | 1:35.833 | PF +1 lap |
| 14 | Bahrain | 15 | 23 | Alexander Albon | Williams | 13 | 56 | +1 lap | 0 | No | 1:35.723 | PF +1 lap |
Let's check if there are any rows with empty data:
df.isnull().sum()
Track 0 Position 0 No 0 Driver 0 Team 0 Starting Grid 0 Laps 0 Time/Retired 0 Points 0 Set Fastest Lap 0 Fastest Lap Time 16 Absolute Time 0 dtype: int64
We can see that the only column that has any empty data is "Fastest Lap Time". Let's take a closer look:
missing_fastest_lap = df[df["Fastest Lap Time"].isnull()]
display(missing_fastest_lap)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | Absolute Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | Saudi Arabia | NC | 10 | Pierre Gasly | Alpine | 18 | 1 | DNF | 0 | No | NaN | DNF |
| 77 | Japan | NC | 3 | Daniel Ricciardo | Racing Bulls | 11 | 0 | DNF | 0 | No | NaN | DNF |
| 78 | Japan | NC | 23 | Alexander Albon | Williams | 14 | 0 | DNF | 0 | No | NaN | DNF |
| 155 | Monaco | NC | 31 | Esteban Ocon | Alpine | 11 | 0 | DNF | 0 | No | NaN | DNF |
| 156 | Monaco | NC | 11 | Sergio Perez | Red Bull Racing | 16 | 0 | DNF | 0 | No | NaN | DNF |
| 157 | Monaco | NC | 27 | Nico Hulkenberg | Haas | 19 | 0 | DNF | 0 | No | NaN | DNF |
| 158 | Monaco | NC | 20 | Kevin Magnussen | Haas | 20 | 0 | DNF | 0 | No | NaN | DNF |
| 238 | Great Britain | NC | 10 | Pierre Gasly | Alpine | 19 | 0 | DNS | 0 | No | NaN | DNS |
| 378 | United States | NC | 44 | Lewis Hamilton | Mercedes | 17 | 1 | DNF | 0 | No | NaN | DNF |
| 397 | Mexico | NC | 23 | Alexander Albon | Williams | 9 | 0 | DNF | 0 | No | NaN | DNF |
| 398 | Mexico | NC | 22 | Yuki Tsunoda | Racing Bulls | 11 | 0 | DNF | 0 | No | NaN | DNF |
| 416 | Brazil | NC | 23 | Alexander Albon | Williams | 7 | 0 | DNS | 0 | No | NaN | DNS |
| 417 | Brazil | NC | 18 | Lance Stroll | Aston Martin | 10 | 0 | DNS | 0 | No | NaN | DNS |
| 457 | Qatar | NC | 43 | Franco Colapinto | Williams | 19 | 0 | DNF | 0 | No | NaN | DNF |
| 458 | Qatar | NC | 31 | Esteban Ocon | Alpine | 20 | 0 | DNF | 0 | No | NaN | DNF |
| 478 | Abu Dhabi | NC | 11 | Sergio Perez | Red Bull Racing | 10 | 0 | DNF | 0 | No | NaN | DNF |
We can see that the data here is all NC as their position and they all were retired as DNF or DNS. Those rows are all results of the drivers crashing, failing to start or a different type of emergency, which exaplins why they don't have a "Fastest Lap Time". We could either delete these rows or update the data from "NaN" to "No time". I think the latter is a better option since I wanna keep as much data as possible for the analysis later on.
df["Fastest Lap Time"] = df["Fastest Lap Time"].fillna("No time")
missing_fastest_lap = df[df["Fastest Lap Time"] == "No time"]
display(missing_fastest_lap)
| Track | Position | No | Driver | Team | Starting Grid | Laps | Time/Retired | Points | Set Fastest Lap | Fastest Lap Time | Absolute Time | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 39 | Saudi Arabia | NC | 10 | Pierre Gasly | Alpine | 18 | 1 | DNF | 0 | No | No time | DNF |
| 77 | Japan | NC | 3 | Daniel Ricciardo | Racing Bulls | 11 | 0 | DNF | 0 | No | No time | DNF |
| 78 | Japan | NC | 23 | Alexander Albon | Williams | 14 | 0 | DNF | 0 | No | No time | DNF |
| 155 | Monaco | NC | 31 | Esteban Ocon | Alpine | 11 | 0 | DNF | 0 | No | No time | DNF |
| 156 | Monaco | NC | 11 | Sergio Perez | Red Bull Racing | 16 | 0 | DNF | 0 | No | No time | DNF |
| 157 | Monaco | NC | 27 | Nico Hulkenberg | Haas | 19 | 0 | DNF | 0 | No | No time | DNF |
| 158 | Monaco | NC | 20 | Kevin Magnussen | Haas | 20 | 0 | DNF | 0 | No | No time | DNF |
| 238 | Great Britain | NC | 10 | Pierre Gasly | Alpine | 19 | 0 | DNS | 0 | No | No time | DNS |
| 378 | United States | NC | 44 | Lewis Hamilton | Mercedes | 17 | 1 | DNF | 0 | No | No time | DNF |
| 397 | Mexico | NC | 23 | Alexander Albon | Williams | 9 | 0 | DNF | 0 | No | No time | DNF |
| 398 | Mexico | NC | 22 | Yuki Tsunoda | Racing Bulls | 11 | 0 | DNF | 0 | No | No time | DNF |
| 416 | Brazil | NC | 23 | Alexander Albon | Williams | 7 | 0 | DNS | 0 | No | No time | DNS |
| 417 | Brazil | NC | 18 | Lance Stroll | Aston Martin | 10 | 0 | DNS | 0 | No | No time | DNS |
| 457 | Qatar | NC | 43 | Franco Colapinto | Williams | 19 | 0 | DNF | 0 | No | No time | DNF |
| 458 | Qatar | NC | 31 | Esteban Ocon | Alpine | 20 | 0 | DNF | 0 | No | No time | DNF |
| 478 | Abu Dhabi | NC | 11 | Sergio Perez | Red Bull Racing | 10 | 0 | DNF | 0 | No | No time | DNF |
Now let's check if there are any null values left in our data:
df.isnull().sum()
Track 0 Position 0 No 0 Driver 0 Team 0 Starting Grid 0 Laps 0 Time/Retired 0 Points 0 Set Fastest Lap 0 Fastest Lap Time 0 Absolute Time 0 dtype: int64
As we can see, there are no more null values in our dataset. It's clean, which means that now we can dive a little bit deeper into what the data actually represents. A good way of doing that is through visualisation.
Visualisation¶
The first thing we might want to look into is the number of races each driver has won:
import pandas as pd
import matplotlib.pyplot as plt
all_drivers = df['Driver'].unique()
wins = df[df['Position'] == '1']
win_counts = wins['Driver'].value_counts()
full_win_counts = pd.Series(0, index=all_drivers)
full_win_counts.update(win_counts)
full_win_counts = full_win_counts.sort_values(ascending=False)
plt.figure(figsize=(12, 8))
bars = plt.bar(full_win_counts.index, full_win_counts.values, color='gold', edgecolor='black')
plt.title('Wins per Driver', fontsize=16)
plt.xlabel('Driver', fontsize=12)
plt.ylabel('Number of Wins', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2., height,
f'{int(height)}',
ha='center', va='bottom')
plt.tight_layout()
plt.show()
From the graph above one may suggest that Max Verstappen exercises considerable dominance over the other drviers. It this true? To get a more detailed perspective on how different the racers are from each other, it a good idea to look at their average fastest laps (The average of their best performance or each track), since only looking at the wins may be deceptive.
df_laps = df.copy()
df_laps = df_laps[df_laps['Fastest Lap Time'] != 'No time']
def lap_time_to_seconds(x):
minutes, seconds = x.split(':')
return int(minutes) * 60 + float(seconds)
df_laps['Fastest Lap Time (s)'] = df_laps['Fastest Lap Time'].apply(lap_time_to_seconds)
avg_fastest_lap_driver = df_laps.groupby('Driver')['Fastest Lap Time (s)'].mean().sort_values()
def seconds_to_lap_time(seconds):
minutes = int(seconds // 60)
secs = seconds % 60
return f"{minutes}:{secs:06.3f}"
plt.figure(figsize=(12,10))
bars = avg_fastest_lap_driver.plot(kind='barh', color='orange')
plt.xlabel('Average Fastest Lap Time (seconds)')
plt.title('Average Fastest Lap Time per Driver')
plt.grid(axis='x')
plt.gca().invert_yaxis()
for index, value in enumerate(avg_fastest_lap_driver):
lap_time_formatted = seconds_to_lap_time(value)
plt.text(value + 0.5, index, lap_time_formatted, va='center')
plt.tight_layout()
plt.show()
As we can see, the difference from driver to driver often is a fragment of a second, meaning the competition is far more fierce than the amount of wins per driver may suggest. Having that data we can also show the number of wins per Team:
all_teams = df['Team'].unique()
team_wins = df[df['Position'] == '1']
team_win_counts = team_wins['Team'].value_counts()
full_team_win_counts = pd.Series(0, index=all_teams)
full_team_win_counts.update(team_win_counts)
full_team_win_counts = full_team_win_counts.sort_values(ascending=False)
plt.figure(figsize=(12, 8))
bars = plt.bar(full_team_win_counts.index, full_team_win_counts.values,
color='#ff0d0d', edgecolor='black')
plt.title('F1 Wins by Team', fontsize=16, pad=20)
plt.xlabel('Team', fontsize=12)
plt.ylabel('Number of Wins', fontsize=12)
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y', linestyle='--', alpha=0.3)
for bar in bars:
height = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2., height,
f'{int(height)}',
ha='center', va='bottom', fontsize=10)
plt.tight_layout()
plt.show()
"Red Bull Racing" is indeed Max Verstappen's team, and since he has the most wins individually, he brings his team up in the Constructors Championship as well. We can see that McLaren, Ferrari and Mercedes are going to be the main favourites to take the throne from Red Bull. Let's analyse the tracks themselves. It's interesting to see which tracks are the most problematic for the drivers. Which ones result it the most amounts of NCs?
nc_df = df[df['Position'] == 'NC']
nc_counts = nc_df['Track'].value_counts().sort_values(ascending=False)
plt.figure(figsize=(10,6))
nc_counts.plot(kind='bar', color='red')
plt.title('Number of NCs per Track')
plt.xlabel('Track')
plt.ylabel('Number of NCs')
plt.xticks(rotation=45, ha='right')
plt.grid(axis='y')
plt.tight_layout()
plt.show()
Here, we can see that Canada and Qatar are the most problematic for drivers. As we saw before, some of the racers DO finish their races, however since they get lapped, they don't complete the full number of laps needed. Let's look into who these drivers are:
all_drivers = df['Driver'].unique()
pf_laps = df[df['Absolute Time'].str.contains('PF', na=False)].copy()
pf_laps['PF Status'] = pf_laps['Absolute Time'].str.extract(r'(PF \+\d+ laps?)')
pf_counts = pf_laps.groupby(['Driver', 'PF Status']).size().unstack(fill_value=0)
pf_counts = pf_counts.reindex(all_drivers, fill_value=0)
pf_counts.plot(kind='bar', stacked=True, figsize=(12, 6), colormap='coolwarm')
plt.title("Number of Times Each Driver Was Lapped")
plt.ylabel("Count")
plt.xlabel("Driver")
plt.xticks(rotation=45, ha='right')
plt.legend(title="Lapping Status")
plt.tight_layout()
plt.show()
From the graph above we can make the conclusion that most drivers get lapped once, more rarely twice. An outlier is Lando Norris who got lapped 7 times. Let's take a look at the track themselves. It's interesting to see which ones require more time. We can do that by taking the average of the fastest laps of each racer for every track:
import pandas as pd
import matplotlib.pyplot as plt
df_laps = df.copy()
df_laps = df_laps[df_laps['Fastest Lap Time'] != 'No time']
def lap_time_to_seconds(x):
minutes, seconds = x.split(':')
return int(minutes) * 60 + float(seconds)
df_laps['Fastest Lap Time (s)'] = df_laps['Fastest Lap Time'].apply(lap_time_to_seconds)
avg_fastest_lap = df_laps.groupby('Track')['Fastest Lap Time (s)'].mean().sort_values()
def seconds_to_lap_time(seconds):
minutes = int(seconds // 60)
secs = seconds % 60
return f"{minutes}:{secs:06.3f}"
plt.figure(figsize=(12,7))
bars = avg_fastest_lap.plot(kind='barh', color='skyblue')
plt.xlabel('Average Fastest Lap Time (seconds)')
plt.title('Average Fastest Lap Time per Race Track')
plt.grid(axis='x')
plt.gca().invert_yaxis()
for index, value in enumerate(avg_fastest_lap):
lap_time_formatted = seconds_to_lap_time(value)
plt.text(value + 0.5, index, lap_time_formatted, va='center')
plt.tight_layout()
plt.show()
Now that we have taken a closer look at the tracks, lets see how each driver performs on each track:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
df_race = df.copy()
df_race['Position'] = pd.to_numeric(df_race['Position'], errors='coerce')
NC_VALUE = 25
df_race['Position_Filled'] = df_race['Position'].fillna(NC_VALUE)
heatmap_data = df_race.pivot_table(index='Driver', columns='Track', values='Position_Filled')
annot_data = heatmap_data.applymap(
lambda x: "NC" if x == NC_VALUE else (f"{int(x)}" if not pd.isna(x) else "")
)
# Plot heatmap
plt.figure(figsize=(16, 10))
sns.heatmap(
heatmap_data,
annot=annot_data,
fmt="",
cmap='YlGnBu_r',
linewidths=0.5,
linecolor='gray',
cbar_kws={'label': 'Finishing Position'},
vmin=1,
vmax=20
)
plt.title('Finishing Positions of Drivers by Race')
plt.xlabel('Track')
plt.ylabel('Driver')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
If we look into the graph above, the more popular drivers tend to have a row, which is almost exclusively blue (the colour blue indicates a higher finish). If we look into the drivers with consistently blue rows, we can see that they tend to be members of the "contender" teams from the graph above: Mercedes, Mclaren, Ferrari, Red Bull. Something important to look into is how positions change and what's the trend behind it.
import seaborn as sns
df_grid = df.copy()
df_grid['Grid'] = pd.to_numeric(df_grid['Starting Grid'], errors='coerce')
df_grid['Position'] = pd.to_numeric(df_grid['Position'], errors='coerce')
df_grid = df_grid.dropna(subset=['Grid', 'Position'])
plt.figure(figsize=(10, 8))
sns.scatterplot(data=df_grid, x='Grid', y='Position', hue='Driver', alpha=0.7, legend=False)
# Diagonal reference line (Grid == Position)
plt.plot([1, 20], [1, 20], 'r--', label='Same Start/Finish')
# Regression line
sns.regplot(data=df_grid, x='Grid', y='Position', scatter=False, color='blue', label='Trend Line')
plt.xlim(1, 20)
plt.ylim(1, 20)
plt.xticks(range(1, 21))
plt.yticks(range(1, 21))
plt.xlabel('Starting Grid Position')
plt.ylabel('Finishing Position')
plt.title('Starting Grid Position vs Finishing Position')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
The red dotted line represents all the results of racer who started and edned the race in the same position. The blue line represents the general trend in the dataset in the realationship between the starting grid and the finishing positions. We can see that the two of them cross at the 8x8 dot. We can draw the conclusion that racers starting at positions from 1-7 tend to finish lower in the rankings and the racers from 9-20 normally have a better chance at upgrading their positions.
Processing¶
Before we move on to select the features for the model, we need to do some more processing. For my "Absolute Time" and "Fastest Lap Time" I will be working in seconds.
def convert_lap_time_to_seconds(lap_time_str):
if isinstance(lap_time_str, str) and ':' in lap_time_str:
minutes, seconds = lap_time_str.split(':')
return int(minutes) * 60 + float(seconds)
return None
df['Fastest Lap Time (s)'] = df['Fastest Lap Time'].apply(convert_lap_time_to_seconds)
def convert_absolute_time_to_seconds(time_str):
if isinstance(time_str, str) and ':' in time_str:
try:
h, m, s = time_str.split(':')
return int(h) * 3600 + int(m) * 60 + float(s)
except:
return None
return None
numeric_cols = ['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)', 'Position']
df['Set Fastest Lap (binary)'] = df['Set Fastest Lap'].map({'Yes': 1, 'No': 0})
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce')
df_cleaned = df[numeric_cols].dropna()
import seaborn as sns
import matplotlib.pyplot as plt
Qualification Dataset¶
For optimal detail in my prediction results, I will be using the qualifying for each race to make the predictions for said race. The information for the qualifications is publicly available at the official Formula 1 site, where I have scraped the data from to make my dataset. Let's take a look at the data:
qualifying_df = pd.read_csv("QUALIFYING.csv")
qualifying_df.head(20)
| Location | Pos | No | Driver | Car | Q1 | Q2 | Q3 | Laps | Team | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Australia | 1 | 4 | Lando Norris | McLaren Mercedes | 1:15.912 | 1:15.415 | 1:15.096 | 20 | McLaren |
| 1 | Australia | 2 | 81 | Oscar Piastri | McLaren Mercedes | 1:16.062 | 1:15.468 | 1:15.180 | 18 | McLaren |
| 2 | Australia | 3 | 1 | Max Verstappen | Red Bull Racing Honda RBPT | 1:16.018 | 1:15.565 | 1:15.481 | 17 | Red Bull Racing |
| 3 | Australia | 4 | 63 | George Russell | Mercedes | 1:15.971 | 1:15.798 | 1:15.546 | 21 | Mercedes |
| 4 | Australia | 5 | 22 | Yuki Tsunoda | Racing Bulls Honda RBPT | 1:16.225 | 1:16.009 | 1:15.670 | 18 | Racing Bulls |
| 5 | Australia | 6 | 23 | Alexander Albon | Williams Mercedes | 1:16.245 | 1:16.017 | 1:15.737 | 21 | Williams |
| 6 | Australia | 7 | 16 | Charles Leclerc | Ferrari | 1:16.029 | 1:15.827 | 1:15.755 | 20 | Ferrari |
| 7 | Australia | 8 | 44 | Lewis Hamilton | Ferrari | 1:16.213 | 1:15.919 | 1:15.973 | 23 | Ferrari |
| 8 | Australia | 9 | 10 | Pierre Gasly | Alpine Renault | 1:16.328 | 1:16.112 | 1:15.980 | 21 | Alpine |
| 9 | Australia | 10 | 55 | Carlos Sainz Jr | Williams Mercedes | 1:16.360 | 1:15.931 | 1:16.062 | 21 | Williams |
| 10 | Australia | 11 | 6 | Isack Hadjar | Racing Bulls Honda RBPT | 1:16.354 | 1:16.175 | NaN | 12 | Racing Bulls |
| 11 | Australia | 12 | 14 | Fernando Alonso | Aston Martin Aramco Mercedes | 1:16.288 | 1:16.453 | NaN | 13 | Aston Martin |
| 12 | Australia | 13 | 18 | Lance Stroll | Aston Martin Aramco Mercedes | 1:16.369 | 1:16.483 | NaN | 15 | Aston Martin |
| 13 | Australia | 14 | 7 | Jack Doohan | Alpine Renault | 1:16.315 | 1:16.863 | NaN | 15 | Alpine |
| 14 | Australia | 15 | 5 | Gabriel Bortoleto | Kick Sauber Ferrari | 1:16.516 | 1:17.520 | NaN | 13 | Kick Sauber |
| 15 | Australia | 16 | 12 | Andrea Kimi Antonelli | Mercedes | 1:16.525 | NaN | NaN | 9 | Mercedes |
| 16 | Australia | 17 | 27 | Nico Hulkenberg | Kick Sauber Ferrari | 1:16.579 | NaN | NaN | 9 | Kick Sauber |
| 17 | Australia | 18 | 30 | Liam Lawson | Red Bull Racing Honda RBPT | 1:17.094 | NaN | NaN | 7 | Red Bull Racing |
| 18 | Australia | 19 | 31 | Esteban Ocon | Haas Ferrari | 1:17.147 | NaN | NaN | 9 | Haas |
| 19 | Australia | NC | 87 | Oliver Bearman | Haas Ferrari | DNS | NaN | NaN | 1 | Haas |
Something we can notice is that not all drivers have a time registered for Q2 or Q3 and the number of laps is uneven. Let me explain the reason behind this: The qualicifation process is divided in three rounds: Q1, Q2 and Q3. In Q1 all the racers take part, however not all of them get to participate in the next round - the slowest 5 are being eliminated. This process is repeated in Q2 going into Q3, the slowest 5 are eliminated. Q1 takes 18 minutes and has all 20 drivers, Q2 takes 15 minutes and has 15 drivers and lastly Q3 lasts 12 minutes and has the fastest 10 drivers. The drivers are being judged on the fastest time they can produce, which is being recorded for them in the respective qualifying period.
The model we created above and trained uses the following features: 'Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary). We can get the starting grid from the quali. results. The amount of laps each race has until it's finished is publicly available information, so we can "hard code" it per race. The Fastest Lap Time and whether or not a racer has set the fastest lap is also information we can get from the qualification results. The main idea now is to use the same features and give them to the model so that we can do the predictions. Before that though, we should do a bit of processing.
def time_to_seconds(time_str):
try:
minutes, rest = time_str.split(':')
seconds = float(rest)
return int(minutes) * 60 + seconds
except:
return "No Time"
for session in ['Q1', 'Q2', 'Q3']:
qualifying_df[session] = qualifying_df[session].fillna("No Time")
qualifying_df[session] = qualifying_df[session].apply(time_to_seconds)
qualifying_df.tail(20)
| Location | Pos | No | Driver | Car | Q1 | Q2 | Q3 | Laps | Team | |
|---|---|---|---|---|---|---|---|---|---|---|
| 100 | Miami | 1 | 1 | Max Verstappen | Red Bull Racing Honda RBPT | 86.87 | 86.643 | 86.204 | 18 | Red Bull Racing |
| 101 | Miami | 2 | 4 | Lando Norris | McLaren Mercedes | 86.955 | 86.499 | 86.269 | 21 | McLaren |
| 102 | Miami | 3 | 12 | Andrea Kimi Antonelli | Mercedes | 87.077 | 86.606 | 86.271 | 20 | Mercedes |
| 103 | Miami | 4 | 81 | Oscar Piastri | McLaren Mercedes | 87.006 | 86.269 | 86.375 | 16 | McLaren |
| 104 | Miami | 5 | 63 | George Russell | Mercedes | 87.014 | 86.575 | 86.385 | 20 | Mercedes |
| 105 | Miami | 6 | 55 | Carlos Sainz Jr | Williams Mercedes | 87.098 | 86.847 | 86.569 | 20 | Williams |
| 106 | Miami | 7 | 23 | Alexander Albon | Williams Mercedes | 87.042 | 86.855 | 86.682 | 20 | Williams |
| 107 | Miami | 8 | 16 | Charles Leclerc | Ferrari | 87.417 | 86.948 | 86.754 | 20 | Ferrari |
| 108 | Miami | 9 | 31 | Esteban Ocon | Haas Ferrari | 87.45 | 86.967 | 86.824 | 21 | Haas |
| 109 | Miami | 10 | 22 | Yuki Tsunoda | Red Bull Racing Honda RBPT | 87.298 | 86.959 | 86.943 | 21 | Red Bull Racing |
| 110 | Miami | 11 | 6 | Isack Hadjar | Racing Bulls Honda RBPT | 87.301 | 86.987 | No Time | 13 | Racing Bulls |
| 111 | Miami | 12 | 44 | Lewis Hamilton | Ferrari | 87.279 | 87.006 | No Time | 15 | Ferrari |
| 112 | Miami | 13 | 5 | Gabriel Bortoleto | Kick Sauber Ferrari | 87.343 | 87.151 | No Time | 15 | Kick Sauber |
| 113 | Miami | 14 | 7 | Jack Doohan | Alpine Renault | 87.422 | 87.186 | No Time | 15 | Alpine |
| 114 | Miami | 15 | 30 | Liam Lawson | Racing Bulls Honda RBPT | 87.444 | 87.363 | No Time | 14 | Racing Bulls |
| 115 | Miami | 16 | 27 | Nico Hulkenberg | Kick Sauber Ferrari | 87.473 | No Time | No Time | 9 | Kick Sauber |
| 116 | Miami | 17 | 14 | Fernando Alonso | Aston Martin Aramco Mercedes | 87.604 | No Time | No Time | 9 | Aston Martin |
| 117 | Miami | 18 | 10 | Pierre Gasly | Alpine Renault | 87.71 | No Time | No Time | 9 | Alpine |
| 118 | Miami | 19 | 18 | Lance Stroll | Aston Martin Aramco Mercedes | 87.83 | No Time | No Time | 9 | Aston Martin |
| 119 | Miami | 20 | 87 | Oliver Bearman | Haas Ferrari | 87.999 | No Time | No Time | 9 | Haas |
Feature Selection¶
Here are the features that I will be using in my model: Starting Grid, Laps, Fastest Time and Absoulte Time (both in seconds) and Set Fastest Lap (binary). The ones that I am not using are: No, Driver, Team, Time/Retired, Points. What number the driver is doesn't affect their performance, just like their own name or the one of their team. I am already using the "Absolute Time", so there is no need for "Time Retired" and the points are a result of the race, after the fact, not something that could be used for a prediction. Let's look into how the selected features work with one another. I have used a heatmap and a scatter matrix.
plt.figure(figsize=(10, 6))
sns.heatmap(df_cleaned[numeric_cols].corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap of Formula 1 Metrics")
plt.tight_layout()
plt.show()
sns.pairplot(df_cleaned, corner=False, plot_kws={'alpha': 0.6, 's': 40})
plt.suptitle("Scatter Matrix of Formula 1 Features", y=1.02)
plt.show()
Splitting into Train/Test¶
from sklearn.model_selection import train_test_split
X = df_cleaned[['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)']]
y = df_cleaned['Position']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")
There are in total 431 observations, of which 344 are now in the train set, and 87 in the test set.
Scaling¶
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Modeling¶
For this project I will go through a couple of model types to see which one will work best. I want to go through: K-Nearest Neighbours, Linear Regression, Descision Trees, Support Vector Machines and Random Forest model types. After we look at the base model, we can enhanse it with hyperparameter tuning or boosting with AdaBoost.
K-Nearest Neighbours:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
y_train_clean = y_train.copy()
y_test_clean = y_test.copy()
knn_regressor = KNeighborsRegressor(n_neighbors=5)
knn_regressor.fit(X_train_scaled, y_train_clean)
y_pred_knn = knn_regressor.predict(X_test_scaled)
mae_knn = mean_absolute_error(y_test_clean, y_pred_knn)
mse_knn = mean_squared_error(y_test_clean, y_pred_knn)
r2_knn = r2_score(y_test_clean, y_pred_knn)
print(f"KNN Mean Absolute Error (MAE): {mae_knn}")
print(f"KNN Mean Squared Error (MSE): {mse_knn}")
print(f"KNN R-squared (R²): {r2_knn}")
KNN Mean Absolute Error (MAE): 3.2896551724137924 KNN Mean Squared Error (MSE): 20.878620689655172 KNN R-squared (R²): 0.30693945214851437
The R² here is around 30 %, which could definetly be improved upon, let's try doing so with hyperparameter tuning:
After hyperparameter tuning:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_neighbors': [3, 5, 7, 9],
'weights': ['uniform', 'distance'],
'p': [1, 2]
}
knn = KNeighborsRegressor()
grid_search = GridSearchCV(
estimator=knn,
param_grid=param_grid,
cv=5,
scoring='neg_mean_absolute_error',
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train_clean)
best_knn = grid_search.best_estimator_
y_pred_best = best_knn.predict(X_test_scaled)
print("Best hyperparameters:", grid_search.best_params_)
print("Tuned MAE:", mean_absolute_error(y_test_clean, y_pred_best))
print("Tuned MSE:", mean_squared_error(y_test_clean, y_pred_best))
print("Tuned R²:", r2_score(y_test_clean, y_pred_best))
Best hyperparameters: {'n_neighbors': 9, 'p': 2, 'weights': 'uniform'}
Tuned MAE: 3.0434227330779047
Tuned MSE: 19.161770966368664
Tuned R²: 0.3639298456944432
Now the R² has improved with 6%, however that still isn't what we are gunning for. Let's try some other models.
Linear Regression:
from sklearn.linear_model import LinearRegression
linear_model = LinearRegression()
linear_model.fit(X_train_scaled, y_train_clean)
y_pred_linear = linear_model.predict(X_test_scaled)
mae_linear = mean_absolute_error(y_test_clean, y_pred_linear)
mse_linear = mean_squared_error(y_test_clean, y_pred_linear)
r2_linear = r2_score(y_test_clean, y_pred_linear)
print("Linear Regression Metrics:")
print(f"MAE: {mae_linear}")
print(f"MSE: {mse_linear}")
print(f"R²: {r2_linear}")
Linear Regression Metrics: MAE: 2.8148606730396013 MSE: 17.51357331379357 R²: 0.418641351068321
This linear regression model has a R² of 41 %, making it a better option that the K-Nearest neighbours. Can we expand upon it though? I will use AdaBoost to try to get an even higher result:
from sklearn.ensemble import AdaBoostRegressor
from sklearn.linear_model import LinearRegression
base_model = LinearRegression()
adaboost_model = AdaBoostRegressor(base_model, n_estimators=50, random_state=1)
adaboost_model.fit(X_train_scaled, y_train_clean)
y_pred_adaboost = adaboost_model.predict(X_test_scaled)
mae_adaboost = mean_absolute_error(y_test_clean, y_pred_adaboost)
mse_adaboost = mean_squared_error(y_test_clean, y_pred_adaboost)
r2_adaboost = r2_score(y_test_clean, y_pred_adaboost)
print("AdaBoost with Linear Regression Metrics:")
print(f"MAE: {mae_adaboost}")
print(f"MSE: {mse_adaboost}")
print(f"R²: {r2_adaboost}")
AdaBoost with Linear Regression Metrics: MAE: 2.85748642122476 MSE: 16.71738732850535 R²: 0.4450705440383784
44% is the highest R² so far. Let's look at more model types:
Decision Trees (Regressor):
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
tree_model = DecisionTreeRegressor(random_state=1)
tree_model.fit(X_train_scaled, y_train_clean)
y_pred_tree = tree_model.predict(X_test_scaled)
mae_tree = mean_absolute_error(y_test_clean, y_pred_tree)
mse_tree = mean_squared_error(y_test_clean, y_pred_tree)
r2_tree = r2_score(y_test_clean, y_pred_tree)
print("Decision Tree Regressor Metrics:")
print(f"MAE: {mae_tree}")
print(f"MSE: {mse_tree}")
print(f"R²: {r2_tree}")
Decision Tree Regressor Metrics: MAE: 3.781609195402299 MSE: 24.93103448275862 R²: 0.1724205983738124
In this model we get an R² of 17%, which is surprisingly low. The cause for that may be over/under fitting of the model. Let's try to fix that with some more hyperparameter tuning.
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
param_grid = {
'max_depth': [3],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
tree = DecisionTreeRegressor(random_state=1)
grid_search = GridSearchCV(
estimator=tree,
param_grid=param_grid,
cv=5,
scoring='neg_mean_absolute_error',
n_jobs=-1
)
grid_search.fit(X_train_scaled, y_train_clean)
best_tree = grid_search.best_estimator_
y_pred_best_tree = best_tree.predict(X_test_scaled)
print("Best Parameters:", grid_search.best_params_)
print("Tuned MAE:", mean_absolute_error(y_test_clean, y_pred_best_tree))
print("Tuned MSE:", mean_squared_error(y_test_clean, y_pred_best_tree))
print("Tuned R²:", r2_score(y_test_clean, y_pred_best_tree))
Best Parameters: {'max_depth': 3, 'min_samples_leaf': 4, 'min_samples_split': 2}
Tuned MAE: 3.1079737085862322
Tuned MSE: 18.65229382431931
Tuned R²: 0.38084181092601066
We can see a jump from 17 to 38%, which is quite significant. This was likely caused by overfitting, which can be avoided by adding a max_depth (in this case the optimal value is 3). Let's look at more:
Support Vector Regressor:
from sklearn.svm import SVR
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
svr_model = SVR()
svr_model.fit(X_train_scaled, y_train_clean)
y_pred_svr = svr_model.predict(X_test_scaled)
mae_svr = mean_absolute_error(y_test_clean, y_pred_svr)
mse_svr = mean_squared_error(y_test_clean, y_pred_svr)
r2_svr = r2_score(y_test_clean, y_pred_svr)
print("Support Vector Regressor Metrics:")
print(f"MAE: {mae_svr}")
print(f"MSE: {mse_svr}")
print(f"R²: {r2_svr}")
Support Vector Regressor Metrics: MAE: 2.851083430866835 MSE: 17.191099331171962 R²: 0.42934579358804736
This one gives us a pretty solid result compared to the others, however I do believe that we can go higher than that.
Random forest:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
rf_model = RandomForestRegressor(n_estimators=100, max_depth=None, random_state=42, n_jobs=-1)
rf_model.fit(X_train_scaled, y_train_clean)
y_pred_rf = rf_model.predict(X_test_scaled)
mae_rf = mean_absolute_error(y_test_clean, y_pred_rf)
mse_rf = mean_squared_error(y_test_clean, y_pred_rf)
r2_rf = r2_score(y_test_clean, y_pred_rf)
print("Random Forest Regressor Metrics:")
print(f"MAE: {mae_rf}")
print(f"MSE: {mse_rf}")
print(f"R²: {r2_rf}")
Random Forest Regressor Metrics: MAE: 2.929655172413793 MSE: 17.43067816091954 R²: 0.42139303476041345
The Random forest base model gives us a result of 42% as well, let's try to enhance it to see how high we can get it:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [None, 10, 20, 30],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None],
'bootstrap': [True, False]
}
rf_model = RandomForestRegressor(random_state=42, n_jobs=-1)
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='neg_mean_absolute_error')
grid_search.fit(X_train_scaled, y_train_clean)
best_params = grid_search.best_params_
print("Best hyperparameters:", best_params)
best_rf_model = grid_search.best_estimator_
y_pred_rf = best_rf_model.predict(X_test_scaled)
mae_rf = mean_absolute_error(y_test_clean, y_pred_rf)
mse_rf = mean_squared_error(y_test_clean, y_pred_rf)
r2_rf = r2_score(y_test_clean, y_pred_rf)
print("Random Forest Regressor Metrics with Hyperparameter Tuning:")
print(f"MAE: {mae_rf}")
print(f"MSE: {mse_rf}")
print(f"R²: {r2_rf}")
Best hyperparameters: {'bootstrap': True, 'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 100}
Random Forest Regressor Metrics with Hyperparameter Tuning:
MAE: 2.8414209708132074
MSE: 15.847555533117363
R²: 0.4739443910999773
After comparing the models, this last one gives us the highest result, meaning we are going to be using it for our predicitons.
Prediction for the Races¶
post_race_results_df = pd.read_csv("RESULTS.csv")
starting_grid_df = pd.read_csv("STARTING_GRID.csv")
all_predictions = []
results_for_csv = []
for location in qualifying_df['Location'].unique():
print(f"\n🏁 Predictions for {location}\n")
race_df = qualifying_df[qualifying_df['Location'] == location].copy()
starting_grids = []
for _, row in race_df.iterrows():
driver = row['Driver']
match = starting_grid_df[
(starting_grid_df['Driver'] == driver) &
(starting_grid_df['Location'] == location)
]
if not match.empty:
starting_grids.append(int(match.iloc[0]['Pos']))
else:
starting_grids.append(np.nan)
race_df['Starting Grid'] = starting_grids
race_df[['Q1', 'Q2', 'Q3']] = race_df[['Q1', 'Q2', 'Q3']].replace('No Time', np.nan).astype(float)
race_df['Fastest Lap Time (s)'] = race_df[['Q1', 'Q2', 'Q3']].min(axis=1)
race_df['Laps'] = 58
race_df['Set Fastest Lap (binary)'] = 0
fastest_driver_idx = race_df['Fastest Lap Time (s)'].idxmin()
race_df.loc[fastest_driver_idx, 'Set Fastest Lap (binary)'] = 1
features = ['Starting Grid', 'Laps', 'Fastest Lap Time (s)', 'Set Fastest Lap (binary)']
predictable_df = race_df.dropna(subset=features).copy()
unpredictable_df = race_df[~race_df.index.isin(predictable_df.index)].copy()
X_qual = predictable_df[features]
X_qual_scaled = scaler.transform(X_qual)
predicted_positions = best_rf_model.predict(X_qual_scaled)
predictable_df['Predicted Position'] = predicted_positions
predictable_df['Predicted Position'] = predictable_df['Predicted Position'].rank(method='first').astype(int)
if not unpredictable_df.empty:
start_rank = predictable_df['Predicted Position'].max() + 1
unpredictable_df['Predicted Position'] = range(start_rank, start_rank + len(unpredictable_df))
final_race_df = pd.concat([predictable_df, unpredictable_df], ignore_index=True)
final_race_df = final_race_df.sort_values(by='Predicted Position').reset_index(drop=True)
final_race_df['Predicted Position'] = final_race_df['Predicted Position'].astype(int)
print(f"{'Driver':<25}{'Team':<20}{'Grid':<8}{'Predicted':<12}{'Actual':<10}{'Difference'}")
for _, row in final_race_df.iterrows():
driver = row['Driver']
team = row['Team']
grid = int(row['Starting Grid']) if pd.notna(row['Starting Grid']) else "N/A"
predicted = row['Predicted Position']
match = post_race_results_df[
(post_race_results_df['Driver'] == driver) &
(post_race_results_df['Location'] == location)
]
actual = match.iloc[0]['Pos'] if not match.empty else "N/A"
if actual == 'NC':
difference = 'X'
actual = 'NC'
elif str(actual).upper() == 'DQ':
difference = 'X'
actual = 'DQ'
elif str(actual).upper() in ['N/A']:
difference = 'X'
actual = 'N/A'
else:
try:
actual = int(actual)
predicted = int(row['Predicted Position'])
if actual == predicted:
difference = '✔'
elif actual < predicted:
difference = f"+{predicted - actual}"
else:
difference = f"-{actual - predicted}"
except:
difference = 'X'
actual = 'Error'
results_for_csv.append({
'Driver': driver,
'Team': team,
'Grid': grid,
'Predicted': predicted,
'Actual': actual,
'Difference': difference,
'Location': location
})
print(f"{driver:<25}{team:<20}{grid:<8}{predicted:<12}{actual:<10}{difference}")
print("-" * 50)
all_predictions.append(final_race_df)
results_df = pd.DataFrame(results_for_csv)
results_df.to_csv("website/data/final_individual_rankings.csv", index=False)
import joblib
joblib.dump(best_rf_model, "website/model/Individual_Predictions.pkl")
🏁 Predictions for Australia Driver Team Grid Predicted Actual Difference Oscar Piastri McLaren 2 1 9 -8 Lando Norris McLaren 1 2 1 +1 Max Verstappen Red Bull Racing 3 3 2 +1 George Russell Mercedes 4 4 3 +1 Yuki Tsunoda Racing Bulls 5 5 12 -7 Alexander Albon Williams 6 6 5 +1 Pierre Gasly Alpine 9 7 11 -4 Charles Leclerc Ferrari 7 8 8 ✔ Lewis Hamilton Ferrari 8 9 10 -1 Carlos Sainz Jr Williams 10 10 NC X Isack Hadjar Racing Bulls 11 11 NC X Fernando Alonso Aston Martin 12 12 NC X Lance Stroll Aston Martin 13 13 6 +7 Liam Lawson Red Bull Racing 18 14 NC X Esteban Ocon Haas 19 15 13 +2 Gabriel Bortoleto Kick Sauber 15 16 NC X Jack Doohan Alpine 14 17 NC X Andrea Kimi Antonelli Mercedes 16 18 4 +14 Nico Hulkenberg Kick Sauber 17 19 7 +12 Oliver Bearman Haas 20 20 14 +6 -------------------------------------------------- 🏁 Predictions for China Driver Team Grid Predicted Actual Difference Oscar Piastri McLaren 1 1 1 ✔ Lewis Hamilton Ferrari 5 2 DQ X Lando Norris McLaren 3 3 2 +1 Max Verstappen Red Bull Racing 4 4 4 ✔ George Russell Mercedes 2 5 3 +2 Isack Hadjar Racing Bulls 7 6 11 -5 Andrea Kimi Antonelli Mercedes 8 7 6 +1 Charles Leclerc Ferrari 6 8 DQ X Yuki Tsunoda Racing Bulls 9 9 16 -7 Alexander Albon Williams 10 10 7 +3 Gabriel Bortoleto Kick Sauber 19 11 14 -3 Carlos Sainz Jr Williams 15 12 10 +2 Jack Doohan Alpine 18 13 13 ✔ Fernando Alonso Aston Martin 13 14 NC X Oliver Bearman Haas 17 15 8 +7 Pierre Gasly Alpine 16 16 DQ X Liam Lawson Red Bull Racing 20 17 12 +5 Nico Hulkenberg Kick Sauber 12 18 15 +3 Lance Stroll Aston Martin 14 19 9 +10 Esteban Ocon Haas 11 20 5 +15 -------------------------------------------------- 🏁 Predictions for Japan Driver Team Grid Predicted Actual Difference Max Verstappen Red Bull Racing 1 1 1 ✔ George Russell Mercedes 5 2 5 -3 Oscar Piastri McLaren 3 3 3 ✔ Charles Leclerc Ferrari 4 4 4 ✔ Lando Norris McLaren 2 5 2 +3 Isack Hadjar Racing Bulls 7 6 8 -2 Lewis Hamilton Ferrari 8 7 7 ✔ Andrea Kimi Antonelli Mercedes 6 8 6 +2 Alexander Albon Williams 9 9 9 ✔ Jack Doohan Alpine 19 10 15 -5 Esteban Ocon Haas 18 11 18 -7 Oliver Bearman Haas 10 12 10 +2 Lance Stroll Aston Martin 20 13 20 -7 Carlos Sainz Jr Williams 15 14 14 ✔ Gabriel Bortoleto Kick Sauber 17 15 19 -4 Nico Hulkenberg Kick Sauber 16 16 16 ✔ Liam Lawson Racing Bulls 13 17 17 ✔ Yuki Tsunoda Red Bull Racing 14 18 12 +6 Fernando Alonso Aston Martin 12 19 11 +8 Pierre Gasly Alpine 11 20 13 +7 -------------------------------------------------- 🏁 Predictions for Bahrain Driver Team Grid Predicted Actual Difference Oscar Piastri McLaren 1 1 1 ✔ Andrea Kimi Antonelli Mercedes 5 2 11 -9 George Russell Mercedes 3 3 2 +1 Pierre Gasly Alpine 4 4 7 -3 Charles Leclerc Ferrari 2 5 4 +1 Lando Norris McLaren 6 6 3 +3 Max Verstappen Red Bull Racing 7 7 6 +1 Carlos Sainz Jr Williams 8 8 NC X Lewis Hamilton Ferrari 9 9 5 +4 Yuki Tsunoda Red Bull Racing 10 10 9 +1 Lance Stroll Aston Martin 19 11 17 -6 Alexander Albon Williams 15 12 12 ✔ Gabriel Bortoleto Kick Sauber 18 13 18 -5 Liam Lawson Racing Bulls 17 14 16 -2 Nico Hulkenberg Kick Sauber 16 15 DQ X Oliver Bearman Haas 20 16 10 +6 Fernando Alonso Aston Martin 13 17 15 +2 Isack Hadjar Racing Bulls 12 18 13 +5 Esteban Ocon Haas 14 19 8 +11 Jack Doohan Alpine 11 20 14 +6 -------------------------------------------------- 🏁 Predictions for Saudi Arabia Driver Team Grid Predicted Actual Difference Max Verstappen Red Bull Racing 1 1 2 -1 Andrea Kimi Antonelli Mercedes 5 2 6 -4 George Russell Mercedes 3 3 5 -2 Oscar Piastri McLaren 2 4 1 +3 Charles Leclerc Ferrari 4 5 3 +2 Lewis Hamilton Ferrari 7 6 7 -1 Yuki Tsunoda Red Bull Racing 8 7 NC X Carlos Sainz Jr Williams 6 8 8 ✔ Pierre Gasly Alpine 9 9 NC X Esteban Ocon Haas 19 10 14 -4 Lando Norris McLaren 10 11 4 +7 Nico Hulkenberg Kick Sauber 18 12 15 -3 Gabriel Bortoleto Kick Sauber 20 13 18 -5 Oliver Bearman Haas 15 14 13 +1 Jack Doohan Alpine 17 15 17 -2 Lance Stroll Aston Martin 16 16 16 ✔ Fernando Alonso Aston Martin 13 17 11 +6 Isack Hadjar Racing Bulls 14 18 10 +8 Liam Lawson Racing Bulls 12 19 12 +7 Alexander Albon Williams 11 20 9 +11 -------------------------------------------------- 🏁 Predictions for Miami Driver Team Grid Predicted Actual Difference Max Verstappen Red Bull Racing 1 1 4 -3 George Russell Mercedes 5 2 3 -1 Andrea Kimi Antonelli Mercedes 3 3 6 -3 Oscar Piastri McLaren 4 4 1 +3 Lando Norris McLaren 2 5 2 +3 Alexander Albon Williams 7 6 5 +1 Charles Leclerc Ferrari 8 7 7 ✔ Oliver Bearman Haas 19 8 NC X Carlos Sainz Jr Williams 6 9 9 ✔ Esteban Ocon Haas 9 10 12 -2 Lance Stroll Aston Martin 18 11 16 -5 Yuki Tsunoda Red Bull Racing 10 12 10 +2 Pierre Gasly Alpine 20 13 13 ✔ Fernando Alonso Aston Martin 17 14 15 -1 Nico Hulkenberg Kick Sauber 16 15 14 +1 Liam Lawson Racing Bulls 15 16 NC X Isack Hadjar Racing Bulls 11 17 11 +6 Gabriel Bortoleto Kick Sauber 13 18 NC X Lewis Hamilton Ferrari 12 19 8 +11 Jack Doohan Alpine 14 20 NC X --------------------------------------------------
Team Results¶
In Formula 1, points are awarded based on finishing positions in the race, with different point values assigned to the top 10 finishers. Finishers outside of the top 10 aren't awared any points. The current F1 points distribution is as follows:
1st place: 25 points
2nd place: 18 points
3rd place: 15 points
4th place: 12 points
5th place: 10 points
6th place: 8 points
7th place: 6 points
8th place: 4 points
9th place: 2 points
10th place: 1 point
import pandas as pd
real_results_df = pd.read_csv("TEAM_RESULTS_2025.csv", header=None, names=["Real Team Position", "Team", "Real Points"])
points_dict = {
1: 25, 2: 18, 3: 15, 4: 12, 5: 10,
6: 8, 7: 6, 8: 4, 9: 2, 10: 1
}
def get_points(pos):
return points_dict.get(pos, 0)
combined_df = pd.concat(all_predictions, ignore_index=True)
combined_df['Predicted Points'] = combined_df['Predicted Position'].apply(get_points)
total_team_points = combined_df.groupby('Team', as_index=False)['Predicted Points'].sum()
total_team_points['Team Rank'] = total_team_points['Predicted Points'].rank(method='first', ascending=False).astype(int)
total_team_points = total_team_points.sort_values('Team Rank')
merged_df = pd.merge(total_team_points, real_results_df, on="Team", how="left")
merged_df['Real Team Position'] = pd.to_numeric(merged_df['Real Team Position'], errors='coerce')
merged_df['Team Rank'] = pd.to_numeric(merged_df['Team Rank'], errors='coerce')
if merged_df['Real Team Position'].isna().any() or merged_df['Team Rank'].isna().any():
print("Warning: NaN values found in 'Real Team Position' or 'Team Rank'.")
print(merged_df[merged_df['Real Team Position'].isna() | merged_df['Team Rank'].isna()])
def get_accuracy(row):
try:
actual = row['Real Team Position']
predicted = row['Team Rank']
if pd.isna(actual) or pd.isna(predicted):
return 'X'
if actual == predicted:
return '✔'
elif actual < predicted:
return f"+{predicted - actual}"
else:
return f"-{actual - predicted}"
except Exception as e:
print(f"Error in accuracy calculation: {e}")
return 'X'
merged_df['Accuracy'] = merged_df.apply(get_accuracy, axis=1)
merged_df = merged_df[['Team', 'Predicted Points', 'Team Rank', 'Real Points', 'Real Team Position', 'Accuracy']]
merged_df.to_csv("website/data/final_team_rankings.csv", index=False)
merged_df.head(15)
| Team | Predicted Points | Team Rank | Real Points | Real Team Position | Accuracy | |
|---|---|---|---|---|---|---|
| 0 | McLaren | 175 | 1 | 246 | 1 | ✔ |
| 1 | Mercedes | 149 | 2 | 141 | 2 | ✔ |
| 2 | Red Bull Racing | 115 | 3 | 105 | 3 | ✔ |
| 3 | Ferrari | 82 | 4 | 94 | 4 | ✔ |
| 4 | Williams | 30 | 5 | 37 | 5 | ✔ |
| 5 | Racing Bulls | 28 | 6 | 8 | 8 | -2 |
| 6 | Alpine | 21 | 7 | 7 | 9 | -2 |
| 7 | Haas | 6 | 8 | 20 | 6 | +2 |
| 8 | Aston Martin | 0 | 9 | 14 | 7 | +2 |
| 9 | Kick Sauber | 0 | 10 | 6 | 10 | ✔ |
Final Conclusions¶
Data Analysis
In this project we looked at the data from the 2024 season and analysed with the help of graphs. We look into the drivers and teams performances from last season to see if there a favourites for the season we are trying to predict. We look at the tracks with the most NCs and the average time between the drivers for each track to get a better idea of what tracks might prove easier or more challenging. We also take a look at the average time of each driver to get a better idea of how close the competition in Formula 1 is. After we have taken a look at the drivers and the tracks, we look into each driver's performance on each track and look for the relationship between the starting and finishing postition and how it changes through the grid. The visualisation part of this notebook is important since it serves as a way of us to analyze the data before we get into the modeling.
Modeling
We looked into different types of models: K-Nearest Neighbours, Linear Regression, Descision Trees, Support Vector Regressor and Random Forrest. We applied boosting with AdaBoost or hyperparameter tuning to the models to maximize their efficiency and eventually ended up using the Random Forrest model. After making the predictions, we can compare the predictions with the real-life outcomes either in the notebook or in the website dashboard.
Domain Analysis
After completing the modeling stage, it's important to reflect on the broader implications of using AI in a high-stakes, data-driven sport like Formula 1. While our prediction model serves as a useful prototype to understand patterns in team and driver performance, it also raises an important question: Should we blindly trust AI? The short answer is no — especially not in isolation.
AI can provide valuable insights, uncover trends, and assist with data-heavy decisions, but it lacks the nuance, intuition, and contextual awareness that human strategists bring to the table. In practice, especially for an F1 team, AI should be seen as an advisor, not a decision-maker. It can help simulate outcomes, evaluate probabilities, and reduce human error in some cases — but final calls should always involve expert judgment, especially given the unpredictable nature of racing.
Looking forward, the role of AI in Formula 1 will almost certainly grow. As data collection improves models will become more and more accurate and possibly even predictive on a per-lap basis. In the context of our project, the model has shown potential and offers a foundation for future development. With more detailed data and advanced features, it could become a practical tool for analysts, teams, and fans alike. But as far as my honest honest advice goes? Treat AI as a co-pilot, not the driver.